DCP-NAS: Discrepant Child-Parent Neural Architecture Search for 1-Bit CNNs
107
BN layer, non-linear activation layer, and max-pooling layer. We omit these for the sake
of simplification. Then, the output ˆaout is binarized to bˆaout by the sign function. The
fundamental objective of BNNs is to calculate ˆw. We want it to be as close as possible before
and after binarization to minimize the binarization effect. Then, we define the reconstruction
error following [77] as
LR( ˆw, β) = ∥ˆw −β ◦b ˆw∥2
2.
(4.23)
Based on the above derivation, the vanilla direct BNAS [36, 114] can be defined as
max
ˆw∈W,ˆα∈A,β∈R+ fb( ˆw, ˆα, β),
(4.24)
where b ˆw = sign( ˆw) is used for inference and ˆα is a neural architecture with binary weights.
Prior to direct BNAS [36] learning the BNAS from such an objective as
max
ˆw∈W,ˆα∈A,β∈R+ ˜fb( ˆw, ˆα, β) =
N
n=1
ˆpn( ˆw, ˆα, β) log(ˆpn(X)),
(4.25)
where we use notations similar to those of Eq. 4.21. Equation 4.25 means that the vanilla
direct BNAS only focuses on the binary search space under the supervision of cross-entropy
loss, which is less effective due to the search process being not exhaustive [24].
4.4.2
Redefine Child-Parent Framework for Network Binarization
Network binarization calculates neural networks with 1-bit weights and activations to fit the
full-precision network, which can significantly compress the CNNs. Prior work [287] usually
investigates the binarization problem by exploring the full-precision model to guide the
optimization of binarized models. Based on the investigation, we reformulate NAS-based
network binarization as a Child-Parent model as shown in Fig. 4.12. The Child and Parent
models are the binarized and full-precision counterparts, respectively.
Conventional NAS is inefficient due to the complicated reward computation in network
training, where the evaluation of a structure is usually done after the network training
converges. There are also some methods to evaluate a cell during the training of the network.
[292] points out that the best choice in the early stages is not necessarily the final optimal
one. However, the worst operation in the early stages usually has a bad performance. This
phenomenon will become more and more significant as the training goes on. Based on this
observation, we propose a simple yet effective operation-removing process, which is the
crucial task of the proposed CP model.
Intuitively, the representation difference between the Children and Parents, and how
many Children can independently handle their problems are two main aspects that should
be considered to define a reasonable performance evaluation measure. Based on this analysis,
we introduce the Child-Parent framework for binary NAS, which defines the objective as
ˆw∗, ˆα∗, β∗=
argmin
ˆw∈ˆ
W,α∈A,β∈R+
LCP-NAS( ˜f P (w, α),
˜f C
b ( ˆw, ˆα, β))
=
argmin
ˆw∈ˆ
W,α∈A,β∈R+
˜f P (w, α) −˜f C
b ( ˆw, ˆα, β),
(4.26)
where ˜f P (w, α) denotes the performance of the real-valued parent model as predefined in
Eq. 4.21. ˜f C
b is further defined as ˜f C
b ( ˆw, α, β) = N
n=1 ˆpn( ˆw, α, β) log (ˆpn(X)) following
Eq. 4.25. As shown in Eq. 4.26, we propose L to estimate the performance of candidate